Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

Parallel fuzzy C-means clustering algorithm in Spark

WANG Guilan, ZHOU Guoliang, SA Churila, ZHU Yongli

Journal of Computer Applications 2016, 36 (2): 342-347. DOI: 10.11772/j.issn.1001-9081.2016.02.0342

Abstract （1114）

PDF （901KB）（1347）

Save

With the growing data volume and timeliness requirement, the clustering algorithms need to be adaptive to big data and higher performance. A new algorithm named Spark Fuzzy C-Means (FCM) was proposed based on Spark distributed in-memory computing platform. Firstly, the matrix was partitioned into vector set horizontally and distributedly stored, which meant different vectors were distributed in different nodes. Then based on the characteristics of FCM algorithm, matrix operations were redesigned considering distributed storage and cache sensitivity, including multiplication, addition and transpose. Finally, Spark-FCM algorithm which combined with matrix operations and Spark platform was implemented. The primary data structures of the algorithm adopted distributed matrix storage with fewer moving data between nodes and distributed computing in each step. The test results in stand-alone and cluster environments show that Spark-FCM has good scalability and can adjust to large-scale data sets, the performance and the size of data shows a linear relationship, and the performance in cluster environment is 2 to 3 times higher than that in stand-alone.

Reference | Related Articles | Metrics